Name: Sung-Jen, Yen
Course: BUDT704
Section: 0502
Date: 11/15/2023
As a member of Prince George's County, University of Maryland neauture thousands of students everyday, providing them meals for a day at school. Therefore, it is both our responsibility and the Prince George County's to monitor and maintain sanitary foods into our mouth.
The dataset from Prince George’s County Health Department gives us precious information on how they inspect and monitor complaint investigations. The department provides several services to residents through the Food Protection/Policy Program, including:
The dataset accumulates inspection records starting March 6, 2019 till November 3, 2023, with 44,600 rows and 29 columns, where each row is a food inspection. We will be mainly foucs on the records that define as restaurant, since these food providers interact the most with students on a daily basis.
Let's first take a look and see what our dataset looks like, so we can determine if need any preprocessing.
# Import data processing libraries
import numpy as np
import pandas as pd
# Import dataset from csv file
df = pd.read_csv('HW6_Food Inspections Nov 2023.csv')
# Count the rows and columns
print(df.shape)
# Take a look at the dataset
df.head()
(44597, 29)
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Hot_and_cold_running_water_provided | Proper_sewage_disposal | No_bare_hand_contact | Adequate_hand_washing_facilities | Rodent_and_insects | Food_contact_surfaces_and_equipment | Inspection_type | Owner | Type | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | ------ | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | ------ | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | Colleen Lyons | Food Establishments | POINT (-76.733979 38.945787) |
| 1 | 16073 | Anarkali Bazar | Grocery Store | 03/24/2020 12:00:00 AM | Non-Compliant - Violations Observed | GREENBELT | MD | 20770.0 | 6106 GREENBELT RD | ------ | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Pre-Opening Inspection | YASIR RANA | Food Establishments | POINT (-76.908332 38.997396) |
| 2 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | ------ | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | ------ | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Food Complaint | NAPOLEON CASTRO | Food Establishments | POINT (-76.944736 38.937661) |
| 3 | 3412 | THE TENDER RIB & CATERING CO | Fast Food - Local | 05/09/2020 12:00:00 AM | ------ | SUITLAND | MD | 20746.0 | 4809 ALLENTOWN RD | ------ | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | CYNTHIA WALLACE | Food Establishments | POINT (-76.889999 38.818813) |
| 4 | 16073 | Anarkali Bazar | Grocery Store | 03/31/2020 12:00:00 AM | Compliance Schedule - Completed | GREENBELT | MD | 20770.0 | 6106 GREENBELT RD | ------ | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Re-inspection | YASIR RANA | Food Establishments | POINT (-76.908332 38.997396) |
5 rows × 29 columns
There are total of 44597 rows and 29 columns, which matchs the dataset description from Price George's County website. However, it seems like some of the rows are having missing values. Let's check the summary for the whole dataset.
# Take a look at the summary descrption of missing values in the dataset
df.isnull().sum()
Establishment_id 0 Name 0 Category 1007 Inspection_date 0 Inspection_results 0 City 24 State 0 Zip 3 Address_line_1 3 Address_line_2 0 Food_from_approved_source 0 Food_protected_from_contamination 0 Ill_workers_restricted 0 Proper_hand_washing 0 Cooling_time_and_temperature 0 Cold_holding_temperature 0 Hot_holding_temperature 0 Cooking_time_and_temperature 0 Reheating_time_and_temperature 0 Hot_and_cold_running_water_provided 0 Proper_sewage_disposal 0 No_bare_hand_contact 0 Adequate_hand_washing_facilities 0 Rodent_and_insects 0 Food_contact_surfaces_and_equipment 0 Inspection_type 2816 Owner 53 Type 0 Location 190 dtype: int64
From the table, we see that Category, City, Inspection_type, Owner, and Location columns have missing values.
However, we shoud also remeber that, from previous view of the dataset, Inspection_results and Address_line2 columns are using '------' to denote missing value. Therefore, we need to switch the coulmns that use '------' to denote missing values to NaN value, so that later when we process data we don't run into errors.
# Replace '------' to NaN value
df.replace('------', np.nan, inplace=True)
# Chck if replace successfully
df.head()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Hot_and_cold_running_water_provided | Proper_sewage_disposal | No_bare_hand_contact | Adequate_hand_washing_facilities | Rodent_and_insects | Food_contact_surfaces_and_equipment | Inspection_type | Owner | Type | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | NaN | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | Colleen Lyons | Food Establishments | POINT (-76.733979 38.945787) |
| 1 | 16073 | Anarkali Bazar | Grocery Store | 03/24/2020 12:00:00 AM | Non-Compliant - Violations Observed | GREENBELT | MD | 20770.0 | 6106 GREENBELT RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Pre-Opening Inspection | YASIR RANA | Food Establishments | POINT (-76.908332 38.997396) |
| 2 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | NaN | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Food Complaint | NAPOLEON CASTRO | Food Establishments | POINT (-76.944736 38.937661) |
| 3 | 3412 | THE TENDER RIB & CATERING CO | Fast Food - Local | 05/09/2020 12:00:00 AM | NaN | SUITLAND | MD | 20746.0 | 4809 ALLENTOWN RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | CYNTHIA WALLACE | Food Establishments | POINT (-76.889999 38.818813) |
| 4 | 16073 | Anarkali Bazar | Grocery Store | 03/31/2020 12:00:00 AM | Compliance Schedule - Completed | GREENBELT | MD | 20770.0 | 6106 GREENBELT RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Re-inspection | YASIR RANA | Food Establishments | POINT (-76.908332 38.997396) |
5 rows × 29 columns
Before we perform analysis, we should see if any categorical columns need transformation so that they can extract more information from our data.
Since we mainly focus on restaurant records, let's first define which categories belong to 'restaurant'.
# Show each category from the food inspection dataset
df['Category'].unique()
array(['Restaurant', 'Grocery Store', 'Fast Food - Local', 'Carry-out',
'Convenience Store', 'Seafood', 'Fast Food - Chain', nan,
'Institution', 'Grocery Store- Medium', 'Meat/Poultry Market',
'Gas Station Store', 'Membership Warehouse', 'Ice Cream', 'Casino',
'Public School', 'Coffee Shop', '@Full Service', 'Specialty Store',
'Hotel', 'Dollar Store', 'Fast Food', 'Full Service',
'Private School', 'Stadium/Amusement Park', 'Bakery',
'Multiple Facilities', 'College/University', 'Catering Only',
'Snack Bar/Concession Stand', 'Limited Service',
'Pre-Packaged Only', 'Deli', 'Bar/Tavern/Lounge',
'Health Care Facility', 'Buffet', 'After School Supper Program',
'Private Club', 'Cafeteria', 'Night Club', 'Grocery Store - Large',
'Church/Temple/Mosque', 'Fire/Community Hall',
'Banquet Hall/Ballroom', 'Senior Nutrition Program',
'Diet/Nutrition Site', 'Bakery/Catering', '@Fast Food-Do Not Use',
'Full Service/Catering', 'Pizza', 'Delivery Only', 'Group Home',
'@Fast Food', 'Farm Market', 'Micro Market',
'@Full Service-Do Not Use', 'Diner', 'Excluded', 'Donut',
"Farmer's Market", 'B & B', 'School'], dtype=object)
From the list, we will select only the facilities that are most common to students when having meals at. Therefore, we will not include grocery stores, markets or any other options that normally not around student activity areas. Also, we will not include bakery, coffee shop, and snack bar/concession stand because they are not in the common sense of 'restaurant'.
# Define the category that categorized as 'restaurant'
mask = ['Restaurant', 'Fast Food - Local', 'Carry-out', 'Seafood', 'Fast Food - Chain',
'Fast Food', 'College/University', 'Catering Only', 'Deli', 'Bar/Tavern/Lounge',
'Buffet', 'Cafeteria', 'Pizza', 'Delivery Only', '@Fast Food', 'Diner', 'Donut',
'B & B', 'School']
# Create a dataset that only contains restaurant inspection records
df_rest = df[df['Category'].isin(mask)]
df_rest.head()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Hot_and_cold_running_water_provided | Proper_sewage_disposal | No_bare_hand_contact | Adequate_hand_washing_facilities | Rodent_and_insects | Food_contact_surfaces_and_equipment | Inspection_type | Owner | Type | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | NaN | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | Colleen Lyons | Food Establishments | POINT (-76.733979 38.945787) |
| 2 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | NaN | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Food Complaint | NAPOLEON CASTRO | Food Establishments | POINT (-76.944736 38.937661) |
| 3 | 3412 | THE TENDER RIB & CATERING CO | Fast Food - Local | 05/09/2020 12:00:00 AM | NaN | SUITLAND | MD | 20746.0 | 4809 ALLENTOWN RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | CYNTHIA WALLACE | Food Establishments | POINT (-76.889999 38.818813) |
| 5 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | NaN | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | Food Complaint | Colleen Lyons | Food Establishments | POINT (-76.733979 38.945787) |
| 6 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | NaN | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | NaN | ... | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | NaN | NAPOLEON CASTRO | Food Establishments | POINT (-76.944736 38.937661) |
5 rows × 29 columns
Now we have the inspection records that contain only restaurants. Let's pivot the columns a litte bit so that we can perform summary statisitc later if needed.
First, we want to create columns for each restaurant category.
# Create new columns for each restaurant category
category = pd.get_dummies(df_rest['Category'])
# Concat these new columns to the restaurant dataset
df_rest = pd.concat([df_rest, category], axis=1)
# Check if new columns are created successfully
df_rest.head()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Delivery Only | Diner | Donut | Fast Food | Fast Food - Chain | Fast Food - Local | Pizza | Restaurant | School | Seafood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | NaN | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | NaN | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 3412 | THE TENDER RIB & CATERING CO | Fast Food - Local | 05/09/2020 12:00:00 AM | NaN | SUITLAND | MD | 20746.0 | 4809 ALLENTOWN RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5 | 900 | LONGHORN STEAKHOUSE OF BOWIE 5174 | Restaurant | 05/13/2020 12:00:00 AM | NaN | BOWIE | MD | 20716.0 | 4100 TOWN CENTER BLVD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6 | 1207 | PUPUSERIA MORAZAN | Fast Food - Local | 05/11/2020 12:00:00 AM | NaN | BRENTWOOD | MD | 20722.0 | 4309 BLADENSBURG RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
5 rows × 48 columns
Now we have different columns for each restaurant category, let's take a look at how each category perfroms in inspection results.
# Creaete summary statistic on inspection result (in percentage)
proportion = df_rest['Inspection_results'].value_counts(normalize=True)
# Print the percentage of restaurants that is currently not in complianc
non_compliance = proportion[~proportion.index.str.match('^Compliance|^Compliant')]
non_compliance
Critical Violations observed 0.370257 Non-Compliant - Violations Observed 0.230753 Facility Closed 0.019793 Facility Reopened 0.010238 No Critical Violations Observed 0.000341 Name: Inspection_results, dtype: float64
From the table we can see that nearly 63% of restaurants are not in compliance to food insepction. Of all the non compliance situation, critical violations accounts over half of the non compliance. This might influence consumers whether or not to dine at PG County's restaurants.
To be able to conduct summary statisitc based on year and month, let's create a new date columns from the original inspection date column.
# Slicing year and month information into the new column
df_rest['Date'] = df_rest['Inspection_date'].str.extract(r'(\d{4})') + '/' + df_rest['Inspection_date'].str.extract(r'(^\d{2})')
# Change the data column from string to time series format for later data anlysis
df_rest['Date'] = pd.to_datetime(df_rest['Date'])
# Create a new dataset that contains the inspection date time series data
df_trans = df_rest.sort_values(by='Date')
# Check the latest data entry and see if match the data souce document description
df_trans.tail()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Diner | Donut | Fast Food | Fast Food - Chain | Fast Food - Local | Pizza | Restaurant | School | Seafood | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44578 | 794 | KENNYS SUB SHOP | Carry-out | 11/02/2023 12:00:00 AM | NaN | GREENBELT | MD | 20770.0 | 8823 GREENBELT RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44591 | 8408 | ATOMIC WINGS | Fast Food - Local | 11/02/2023 12:00:00 AM | Non-Compliant - Violations Observed | HYATTSVILLE | MD | 20782.0 | 3124 Queens Chapel RD | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44592 | 1211 | QUEENSWAY RESTAURANT | Fast Food - Local | 11/02/2023 12:00:00 AM | Non-Compliant - Violations Observed | RIVERDALE | MD | 20737.0 | 5851 RIVERDALE RD | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44571 | 6154 | AUNTIE ANNE'S PLANET SMOOTHIE | Fast Food - Chain | 11/01/2023 12:00:00 AM | Critical Violations observed | OXON HILL | MD | 20745.0 | 6800 OXON HILL RD | 895 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 8645 | 8128 | WINGSTOP | Fast Food - Chain | 05/17/2026 12:00:00 AM | Critical Violations observed | LAUREL | MD | 20707.0 | 14623 BALTIMORE AVE | NaN | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2026-05-01 |
5 rows × 49 columns
There shouldn't be any record that has inspection date later than 2023, so we should drop that row.
# Drop the row that does not comply to the data recorded time span
df_trans.drop(df_trans.index[-1], inplace=True)
# Check if drop successfully
df_trans.tail()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Diner | Donut | Fast Food | Fast Food - Chain | Fast Food - Local | Pizza | Restaurant | School | Seafood | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44582 | 1467 | SUBWAY SANDWICH SHOP | Fast Food - Chain | 11/02/2023 12:00:00 AM | NaN | GREENBELT | MD | 20770.0 | 6094 GREENBELT RD | NaN | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44578 | 794 | KENNYS SUB SHOP | Carry-out | 11/02/2023 12:00:00 AM | NaN | GREENBELT | MD | 20770.0 | 8823 GREENBELT RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44591 | 8408 | ATOMIC WINGS | Fast Food - Local | 11/02/2023 12:00:00 AM | Non-Compliant - Violations Observed | HYATTSVILLE | MD | 20782.0 | 3124 Queens Chapel RD | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44592 | 1211 | QUEENSWAY RESTAURANT | Fast Food - Local | 11/02/2023 12:00:00 AM | Non-Compliant - Violations Observed | RIVERDALE | MD | 20737.0 | 5851 RIVERDALE RD | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2023-11-01 |
| 44571 | 6154 | AUNTIE ANNE'S PLANET SMOOTHIE | Fast Food - Chain | 11/01/2023 12:00:00 AM | Critical Violations observed | OXON HILL | MD | 20745.0 | 6800 OXON HILL RD | 895 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2023-11-01 |
5 rows × 49 columns
Now we have the correct date column, we might want to know the violations for each restaurants. It would be much easier to count the number of violations if we can change the compliance result into 0 and 1, where 0 means compliance and 1 means non compliance.
# Check if every columns ony has 'In Compliance' and 'Out of Compliance' as categorical value.
df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].describe()
| Food_from_approved_source | Food_protected_from_contamination | Ill_workers_restricted | Proper_hand_washing | Cooling_time_and_temperature | Cold_holding_temperature | Hot_holding_temperature | Cooking_time_and_temperature | Reheating_time_and_temperature | Hot_and_cold_running_water_provided | Proper_sewage_disposal | No_bare_hand_contact | Adequate_hand_washing_facilities | Rodent_and_insects | Food_contact_surfaces_and_equipment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 | 21385 |
| unique | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| top | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance | In Compliance |
| freq | 21342 | 20513 | 21371 | 20578 | 21160 | 17259 | 19532 | 21295 | 21315 | 21230 | 21340 | 21323 | 21041 | 17687 | 17068 |
# Switch 'In compliance' to 1 and 'Out of Compliance' to 0
df_trans.replace({'In Compliance': 0, 'Out of Compliance': 1}, inplace=True)
# # Check if replace values successfully
df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].head()
| Food_from_approved_source | Food_protected_from_contamination | Ill_workers_restricted | Proper_hand_washing | Cooling_time_and_temperature | Cold_holding_temperature | Hot_holding_temperature | Cooking_time_and_temperature | Reheating_time_and_temperature | Hot_and_cold_running_water_provided | Proper_sewage_disposal | No_bare_hand_contact | Adequate_hand_washing_facilities | Rodent_and_insects | Food_contact_surfaces_and_equipment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29062 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 16878 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10110 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27477 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11170 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Create a new column to count the number of violation
df_trans['Number_of_violation'] = df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].sum(axis=1)
# Check if the column is created successfully
df_trans.head()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Donut | Fast Food | Fast Food - Chain | Fast Food - Local | Pizza | Restaurant | School | Seafood | Date | Number_of_violation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29062 | 1088 | PANDA EXPRESS 1793 | Fast Food | 07/21/2011 12:00:00 AM | NaN | BRANDYWINE | MD | 20613.0 | 15902 CRAIN HWY | NaN | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 |
| 16878 | 43 | ANDREWS RESTAURANT | Restaurant | 07/13/2011 12:00:00 AM | NaN | SUITLAND | MD | 20746.0 | 6407 SUITLAND RD | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2011-07-01 | 2 |
| 10110 | 561 | GAYLORD NATIONAL - OLD HICKORY STEAKHOUSE | Restaurant | 07/07/2011 12:00:00 AM | NaN | OXON HILL | MD | 20745.0 | 201 WATERFRONT ST | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2011-07-01 | 1 |
| 27477 | 1867 | EDIBLE ARRANGEMENTS | Catering Only | 07/20/2011 12:00:00 AM | NaN | LAUREL | MD | 20707.0 | 13600 BALTIMORE AVE | 204 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 |
| 11170 | 289 | CHICK-FIL-A | Fast Food | 07/21/2011 12:00:00 AM | NaN | BRANDYWINE | MD | 20613.0 | 16003 CRAIN HWY | NaN | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 |
5 rows × 50 columns
# Create a new dataset in wide foramt and contains only establishment ID, Category, Inspection_date, and number of violation columns
df_wide = df_trans[['Establishment_id', 'Category', 'Inspection_date', 'Number_of_violation']]
df_wide = df_wide.set_index(['Establishment_id', 'Category']).sort_index()
df_wide
| Inspection_date | Number_of_violation | ||
|---|---|---|---|
| Establishment_id | Category | ||
| 2 | Carry-out | 07/19/2011 12:00:00 AM | 2 |
| Carry-out | 11/20/2012 12:00:00 AM | 1 | |
| Carry-out | 12/03/2012 12:00:00 AM | 0 | |
| Carry-out | 04/15/2014 12:00:00 AM | 0 | |
| Carry-out | 10/22/2014 12:00:00 AM | 1 | |
| ... | ... | ... | ... |
| 16291 | Restaurant | 06/16/2023 12:00:00 AM | 1 |
| 16309 | Fast Food - Chain | 12/02/2021 12:00:00 AM | 1 |
| Fast Food - Chain | 06/28/2022 12:00:00 AM | 0 | |
| Fast Food - Chain | 08/11/2022 12:00:00 AM | 0 | |
| 16837 | Fast Food - Chain | 04/06/2023 12:00:00 AM | 0 |
21385 rows × 2 columns
Now we have successfully pivot out columns, we can start our analysis.
First and foremost, we want to know which catagory of viloations happen the most. This is important becasue we want to help PG county find the top violations so that they can tackle the sanitation problems that is most often complained.
# Count the frequency of different categories violation
violation = df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].sum().sort_values(ascending=False)
violation
Food_contact_surfaces_and_equipment 4317 Cold_holding_temperature 4126 Rodent_and_insects 3698 Hot_holding_temperature 1853 Food_protected_from_contamination 872 Proper_hand_washing 807 Adequate_hand_washing_facilities 344 Cooling_time_and_temperature 225 Hot_and_cold_running_water_provided 155 Cooking_time_and_temperature 90 Reheating_time_and_temperature 70 No_bare_hand_contact 62 Proper_sewage_disposal 45 Food_from_approved_source 43 Ill_workers_restricted 14 dtype: int64
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Create visaulization of frequency ditibution
axis = sns.barplot(x=violation.values, y=violation.index, palette=['Blue'], orient='h')
axis.set_title('Frequeny Distribution of Violations', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.xlabel('Count of Violations')
plt.xticks(rotation='horizontal')
for p in axis.patches:
width = p.get_width()
axis.annotate(f'{width:.0f}',xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center')
From the table, we can see that Food_contact_surfaces_and_equipment, Cold_holding_temperature, and rodent_and_insects are the most common violations. It is clear that unhygienic food processing environments are prevalent among restaurants. This is bad, since food often spoil easily in high temperature or unhygienic surface where bacteria loves to grow.
Therefore, we would suggest PG county put effort on enforcing equipment temperatue check and kitchen sanitising on routine basis.
Now we know the most common violation. Let's see how has the violation rate change. We want to know the proportion of restaurants that violate more than one inpsection by month and year. And see if there is any trend that inspections are getting harder or easier over time.
# Create a new dataset to count all the restaurants that have more than one vilation by month and year.
df_trans2 = df_trans.loc[df_trans['Number_of_violation'] >= 1, ['Date', 'Number_of_violation']]
# Calcualte the violation rate in each month by year
percentage = df_trans2.groupby('Date')['Number_of_violation'].count() / df_trans[['Date', 'Number_of_violation']].groupby('Date')['Number_of_violation'].count()
percentage
Date
2011-07-01 0.512195
2011-08-01 0.389831
2011-09-01 0.295455
2011-10-01 0.304348
2011-11-01 0.366667
...
2023-07-01 0.556818
2023-08-01 0.442105
2023-09-01 1.000000
2023-10-01 0.350877
2023-11-01 0.200000
Name: Number_of_violation, Length: 149, dtype: float64
# Transform the dataset to have seperate year and month column for data visualizatin
percentage = percentage.to_frame()
percentage.reset_index(inplace=True)
percentage['Year'] = percentage['Date'].dt.year
percentage['Month'] = percentage['Date'].dt.month
percentage
| Date | Number_of_violation | Year | Month | |
|---|---|---|---|---|
| 0 | 2011-07-01 | 0.512195 | 2011 | 7 |
| 1 | 2011-08-01 | 0.389831 | 2011 | 8 |
| 2 | 2011-09-01 | 0.295455 | 2011 | 9 |
| 3 | 2011-10-01 | 0.304348 | 2011 | 10 |
| 4 | 2011-11-01 | 0.366667 | 2011 | 11 |
| ... | ... | ... | ... | ... |
| 144 | 2023-07-01 | 0.556818 | 2023 | 7 |
| 145 | 2023-08-01 | 0.442105 | 2023 | 8 |
| 146 | 2023-09-01 | 1.000000 | 2023 | 9 |
| 147 | 2023-10-01 | 0.350877 | 2023 | 10 |
| 148 | 2023-11-01 | 0.200000 | 2023 | 11 |
149 rows × 4 columns
# Import interactive data visualizations library
import plotly.express as px
# Create visualization for violation rate over time
fig = px.line(percentage, x="Month", y="Number_of_violation", color='Year', title='Restaurant with At Least One Violation by Month')
fig.update_xaxes(tickmode='array', tickvals=list(range(1, 13)),
ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
fig.show()
*Hint: select the years needed for comparison
If we look at the years (for instance 2017 to 2020) before COVID, we can see that the percentage of restuarants having more than one violations decrease gradually by year. It seems like the inspection becomes easier over years.
However, if you add in 2020 and 2021 into consideration, we can notice that the violation proportion rise again. This may due to the PG county precaution for each restaurant to prevent COVID spread out; therefore, perform harder inpection on sanitation.
In addition, it seems like April tend to have lower violations. This may due to two possible reasons:
To prove our point we should check the corresponding data at that same period of time. Suppose we take a look at the 2020 data.
# Check the number of inspections for April from year 2017 to 2020
for i in range(2017, 2021):
df_year = df_trans[df_trans['Date'].dt.year == i]
print(df_year.groupby('Date')['Establishment_id'].count().sort_values())
Date 2017-04-01 183 2017-12-01 193 2017-09-01 202 2017-01-01 209 2017-02-01 212 2017-11-01 219 2017-10-01 230 2017-03-01 240 2017-05-01 258 2017-07-01 259 2017-08-01 272 2017-06-01 279 Name: Establishment_id, dtype: int64 Date 2018-12-01 125 2018-04-01 127 2018-10-01 128 2018-09-01 144 2018-03-01 152 2018-02-01 154 2018-08-01 157 2018-11-01 168 2018-05-01 193 2018-06-01 193 2018-07-01 206 2018-01-01 243 Name: Establishment_id, dtype: int64 Date 2019-01-01 109 2019-02-01 127 2019-09-01 142 2019-04-01 153 2019-05-01 153 2019-10-01 162 2019-03-01 169 2019-07-01 184 2019-06-01 208 2019-08-01 252 2019-11-01 262 2019-12-01 280 Name: Establishment_id, dtype: int64 Date 2020-04-01 8 2020-11-01 22 2020-05-01 25 2020-10-01 41 2020-12-01 43 2020-09-01 48 2020-08-01 49 2020-06-01 67 2020-07-01 158 2020-03-01 194 2020-02-01 271 2020-01-01 303 Name: Establishment_id, dtype: int64
We can see that it did match our assumption, low violation rate does not neccessarily means restaurants are performing better. It might be a result of low inspection instance!
Lastly, we want to know are there any particular areas in PG county with more violations. This is important to us, as students, we should care about how our surrounding areas' restaurant perform, as we interact the most with them on a daily basis.
Let's say we want to investigate all the restaurents in PG county this year (2023) performance.
# Extract longitude and latitude information for the map visulaization
df_trans['Longitude'] = df_trans['Location'].str.extract(r'(-?\d+.\d+)').astype(float)
df_trans['Latitude'] = df_trans['Location'].str.extract(r' (\d+.\d+)').astype(float)
# Denote mark down color category for map visualization
df_trans['Mark'] = df_trans['Number_of_violation'].map(lambda x: 'More than 1 violation' if x >= 1 else 'No violation')
df_trans.head()
| Establishment_id | Name | Category | Inspection_date | Inspection_results | City | State | Zip | Address_line_1 | Address_line_2 | ... | Fast Food - Local | Pizza | Restaurant | School | Seafood | Date | Number_of_violation | Longitude | Latitude | Mark | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29062 | 1088 | PANDA EXPRESS 1793 | Fast Food | 07/21/2011 12:00:00 AM | NaN | BRANDYWINE | MD | 20613.0 | 15902 CRAIN HWY | NaN | ... | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 | -76.875918 | 38.673414 | No violation |
| 16878 | 43 | ANDREWS RESTAURANT | Restaurant | 07/13/2011 12:00:00 AM | NaN | SUITLAND | MD | 20746.0 | 6407 SUITLAND RD | NaN | ... | 0 | 0 | 1 | 0 | 0 | 2011-07-01 | 2 | -76.895252 | 38.829488 | More than 1 violation |
| 10110 | 561 | GAYLORD NATIONAL - OLD HICKORY STEAKHOUSE | Restaurant | 07/07/2011 12:00:00 AM | NaN | OXON HILL | MD | 20745.0 | 201 WATERFRONT ST | NaN | ... | 0 | 0 | 1 | 0 | 0 | 2011-07-01 | 1 | -77.016084 | 38.782417 | More than 1 violation |
| 27477 | 1867 | EDIBLE ARRANGEMENTS | Catering Only | 07/20/2011 12:00:00 AM | NaN | LAUREL | MD | 20707.0 | 13600 BALTIMORE AVE | 204 | ... | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 | -76.868105 | 39.076438 | No violation |
| 11170 | 289 | CHICK-FIL-A | Fast Food | 07/21/2011 12:00:00 AM | NaN | BRANDYWINE | MD | 20613.0 | 16003 CRAIN HWY | NaN | ... | 0 | 0 | 0 | 0 | 0 | 2011-07-01 | 0 | -76.876294 | 38.670276 | No violation |
5 rows × 53 columns
# Create a new dataset that only contains 2023 inspection records
df_2023 = df_trans[df_trans['Date'].dt.year == 2023]
# Create a map that shows all the restaurant in 2023
fig = px.scatter_mapbox(df_2023, lat="Latitude", lon="Longitude", hover_name="Name", hover_data=["Category", "Address_line_1"],
color='Mark', color_discrete_sequence=["blue", "red"], zoom=10, height=500)
fig.update_layout(title="2023 PG County Restaurant Performance", mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
From the map, it looks like the further we are away from unrban area, the restaurants with no violation become more visible. This seems resonable, as downtown and close to city region tend to have more customers every day; therefore, it could be easily to neglect on maintaining sanitation.
Also, if we zoom in a liitle bit and take a look at College Park area, where the Univesity of Maryland stands at, we can see that there seems to be fairly amount of restaurants comply to the PG county food inspection. The most plausable reason will be that the restaurants near campus need to follow not only PG county standard but also the university standard.
Although we kind of know the distribution of restaurants having more than 1 violation, but we want to dig even deeper. We would like to know exactly how sivere they are and whether there is some clustering among them. For instance, which category of restaurant tend to have more violation. Again, here we are going to use 2023 data as our investigation sample.
# Create a map that shows all the restaurant in 2023
# Warm tune means higher number of violations and cold tune means lower number of violations
px.set_mapbox_access_token("pk.eyJ1Ijoic3llbjEyMzUiLCJhIjoiY2xveXltMzVuMDhoeTJpb3A0a2VlYjM0cyJ9.I4ikCf8sBEMgHP1heJfOKQ")
fig = px.scatter_mapbox(df_2023, lat="Latitude", lon="Longitude", hover_name="Name", hover_data=["Category"], color="Number_of_violation", size="Number_of_violation",
color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
fig.show()
If we hover on the points where circle are red and orange, it seems like restaurant categories like Fast food and Carry out tend to have higher violation. Also, if we use lasso select Landover area, it seems like Fast food and Carry out restaurants have some cluster over that region. This kind of make sense, since fast food and carry out restaurants focus more on service efficiency and less on the overall quality of food presentation. Therefore, it is no suprise that they have a hard time to maintain high sanitary standards.
"I pledge on my honor that I have not given nor received any unauthorized assistance on this assignment."
--Sung-Jen, Yen